Evaluating gender bias in large language models in long-term care

Sam Rickman
BMC Medical Informatics and Decision Making
Care Policy and Evaluation Centre, LSE, London WC2A 2AE, UK

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigates gender bias in large language models (LLMs) used for summarizing long-term care records. The researchers evaluated two state-of-the-art, open-source LLMs released in 2024, Meta's Llama 3 and Google's Gemma, alongside older benchmark models (T5 and BART). They used a 'counterfactual fairness' approach, creating gender-swapped versions of 617 real-world care records and comparing the summaries generated by each model for the male and female versions. Bias was quantified through sentiment analysis, thematic comparisons (e.g., frequency of health-related terms), and word-level analysis.

The results revealed a stark contrast in bias across the models. Llama 3 showed no discernible gender bias across any metric. Gemma, however, exhibited significant bias, producing more negative summaries for men and focusing more on their physical and mental health issues. Gemma's summaries also used more direct language when describing men's conditions (e.g., "disabled"), while often downplaying women's needs and using more euphemistic language (e.g., "requires assistance"). The older models, T5 and BART, showed some moderate levels of bias, including the addition of negative judgments for female subjects and stereotypical framing of needs.

The study's analysis went beyond simply identifying bias; it also investigated the nature of the bias. A key finding was that Gemma's bias stemmed primarily from the omission of information for women, rather than the fabrication of information for men. For example, specific diagnoses listed in male summaries were often replaced with vague terms like "health complications" in the corresponding female summaries. This suggests that the model systematically underrepresents the severity and complexity of women's health needs. The researchers also found that Gemma often framed summaries about women in a more indirect, meta-narrative style (e.g., "The text describes..."), further distancing the summary from the person and potentially diminishing the impact of their needs.

The study concludes that biased LLM outputs, particularly those from Gemma, pose a tangible risk of creating gender-based disparities in care allocation. Since services are based on documented need, summaries that underemphasize women's health issues could lead to unequal access to care. The researchers argue for the importance of rigorous, model-specific bias evaluation before deploying LLMs in clinical settings and recommend that regulators mandate such evaluations. The study's methodological framework, including the counterfactual analysis and multi-pronged bias quantification, is presented as a practical tool for researchers and practitioners to conduct similar evaluations.

Research Impact and Future Directions

This study makes a valuable contribution to the growing body of research on bias in large language models (LLMs). By focusing on the specific context of long-term care and using a rigorous, interpretable methodology, it provides compelling evidence of how gender bias can manifest in LLM-generated summaries, even in state-of-the-art models. The stark contrast between Llama 3 and Gemma highlights the critical need for model-specific bias evaluation before deployment in real-world healthcare settings. While the study acknowledges its limitations, particularly regarding input text length and the generalizability of its findings, its methodological framework offers a practical and reproducible tool for future research in other healthcare domains and across different protected characteristics.

The paper's strength lies not only in its empirical findings but also in its clear articulation of the potential for 'allocational harm' arising from biased LLM outputs. By connecting subtle linguistic differences to tangible consequences for care provision, it underscores the urgency of addressing bias in AI-driven healthcare tools. The call for regulatory mandates on bias measurement is a logical and actionable policy recommendation that could have a significant impact on ensuring equitable access to care. While the study focuses on detection and characterization, future work could leverage its findings to develop targeted bias mitigation strategies, paving the way for more equitable and responsible AI integration in healthcare.

Critical Analysis and Recommendations

Well-Structured Abstract (written-content)
Clear, logical structure following the standard format makes the study's core message easily accessible.
Section: Abstract
Specific and Impactful Findings (written-content)
Specific model comparisons and detailed descriptions of bias enhance the impact and understandability of the findings.
Section: Abstract
Quantify the Magnitude of Bias (written-content)
Adding a key quantitative metric would strengthen the initial impact by providing a concrete measure of the effect size.
Section: Abstract
Name the Methodological Framework (written-content)
Explicitly naming the counterfactual fairness methodology would enhance clarity and memorability.
Section: Abstract
Effective Funnel Structure (written-content)
The 'funnel' structure, from broad context to specific research questions, makes the argument clear and compelling.
Section: Introduction
Precise Bias Definitions (written-content)
Precise definitions of bias types ('representational', 'allocational') provide a robust theoretical foundation.
Section: Introduction
Timely and Relevant Context (written-content)
Referencing recent political initiatives and model releases establishes the study's timeliness and relevance.
Section: Introduction
Foreshadow Key Findings (written-content)
Foreshadowing the stark model performance differences would create narrative tension and further justify model-specific evaluation.
Section: Introduction
Introduce Core Methodology Earlier (written-content)
Introducing 'counterfactual fairness' earlier would strengthen the claim of methodological contribution.
Section: Introduction
Rigorous Counterfactual Data (written-content)
Using Llama 3 for gender-swapping and validating for identical sentence structure ensures a highly controlled analysis.
Section: Materials and methods
Comprehensive Statistical Analysis (written-content)
The multi-faceted analysis, combining sentiment, thematic, and word-level comparisons, provides a comprehensive bias assessment.
Section: Materials and methods
Bias-Free Measurement Tools (written-content)
Pre-validating sentiment metrics for inherent bias ensures measured differences are attributable to the LLMs.
Section: Materials and methods
Justify Llama 3 for Pre-processing (written-content)
Justifying the choice of Llama 3 for pre-processing would address potential concerns about circularity.
Section: Materials and methods
Refine Thematic Word List Methodology (written-content)
More detail on the manual refinement of thematic word lists would enhance transparency and reproducibility.
Section: Materials and methods
Clear Illustration of Gender-Swapping (graphical-figure)
The clear two-column format effectively illustrates the gender-swapping process.
Section: Materials and methods
Expand on Data Transformation Details (graphical-figure)
Adding details about automation and more complex examples would enhance replicability and comprehensiveness.
Section: Materials and methods
Robust Evidence Triangulation (written-content)
Triangulating quantitative findings with qualitative examples makes the results concrete and interpretable.
Section: Results
Clear and Logical Structure (written-content)
The logical progression from general sentiment to specific word-level analysis makes the argument easy to follow.
Section: Results
Alternative Explanations Addressed (written-content)
Ruling out hallucinations strengthens the conclusion that Gemma's bias is due to omission, not fabrication.
Section: Results
Create Narrative Personas (written-content)
Synthesizing word-level findings into narrative personas would provide a more holistic interpretation of Gemma's bias.
Section: Results
Visualize Comparative Bias (written-content)
A summary figure visualizing comparative bias would enhance accessibility and highlight key findings.
Section: Results
Real-World Implications of Bias (written-content)
Linking linguistic bias to 'allocational harm' and providing concrete examples makes the implications clear and urgent.
Section: Discussion
Transparent Limitations Discussion (written-content)
The nuanced discussion of limitations, including methodological trade-offs and statistical vs. practical significance, enhances credibility.
Section: Discussion
Value of Interpretable Framework (written-content)
Articulating the value of an interpretable framework positions the methodology as a key contribution.
Section: Discussion
Investigate Model Disparity Causes (written-content)
Exploring potential causes for model performance differences would generate hypotheses for future research.
Section: Discussion
Propose Bias Mitigation Strategies (written-content)
Proposing a bias mitigation pathway would enhance the paper's practical contribution and provide a roadmap for solutions.
Section: Discussion
Clear and Decisive Summary (written-content)
The concise summary of model performance and bias types provides a strong, memorable takeaway.
Section: Conclusion
Actionable Policy Recommendation (written-content)
The recommendation to mandate bias measurement is an actionable policy implication based on the findings.
Section: Conclusion
Re-emphasize Interpretable Framework (written-content)
Reiterating the value of the interpretable framework would strengthen the conclusion's emphasis on the methodological contribution.
Section: Conclusion
Articulate a Vision for Equitable AI (written-content)
Concluding with a forward-looking vision for equitable AI would provide a more inspiring final thought.
Section: Conclusion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Materials and methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1 Examples of paired sentences used as input to summarisation models
Figure/Table Image (Page 5)
Table 1 Examples of paired sentences used as input to summarisation models
First Reference in Text
See Table 1 for examples of such changes.
Description
  • Illustrates the gender-swapping process for input data: This table provides concrete examples of how the study's input data was created. It features two columns, 'Original' and 'Gender swapped', showing pairs of sentences. The 'Original' column contains sentences from long-term care records describing female individuals. The 'Gender swapped' column presents the corresponding sentences where all gender-specific words and titles (e.g., 'Mrs', 'woman', 'lady') have been changed to their male equivalents (e.g., 'Mr', 'man', 'gentleman'). This process is a key part of the study's methodology, known as a counterfactual analysis, which aims to test for bias by creating pairs of texts that are identical in every way except for the gender of the person described.
  • Provides specific examples of textual changes: The table shows two distinct examples of the transformation. In the first, a sentence describing 'Mrs Smith is an 87 year old, white British woman' is altered to 'Mr Smith is an 87 year old, white British man'. In the second, a sentence about 'Mrs Jones is an older lady' is changed to 'Mr Jones is an older gentleman'. These examples clearly show the direct, one-to-one substitutions made to prepare the data for the language models.
Scientific Validity
  • ✅ Demonstrates a methodologically sound counterfactual setup: The table effectively illustrates the implementation of the counterfactual fairness framework. By presenting paired sentences that differ only in gendered terms, it transparently shows how the input data was prepared to isolate gender as the variable of interest. This is a robust and appropriate method for testing the specific type of bias being investigated.
  • 💡 Lacks detail on the automation and scope of the swapping process: While the examples are clear, they are simple. The text mentions that Llama 3 was used to automate this process, which is a critical detail for replicability. The table or its caption could be enhanced by briefly noting that this was an automated process and providing insight into the rules or prompts used. Furthermore, including an example that involves more complex changes, such as pronoun substitution ('she'/'her' to 'he'/'him') within a longer sentence, would offer a more comprehensive view of the data transformation's robustness.
Communication
  • ✅ Excellent clarity and simplicity: The two-column 'Original' vs. 'Gender swapped' format is exceptionally clear and intuitive. The chosen examples are straightforward and effectively communicate the core concept of the data manipulation without any distracting complexity. This makes the methodology immediately understandable.
  • ✅ Highly self-contained and supportive of the main text: The table, combined with its descriptive caption, functions well as a standalone element. A reader can quickly grasp the fundamental data preparation step just by viewing the table, which perfectly complements the methodological description in the main text by providing a concrete illustration.
  • 💡 Minor formatting could enhance readability: To improve the efficiency of communication, consider using bold formatting to highlight the specific words that were changed in the 'Gender swapped' column (e.g., '**Mr** Smith', '**man**'). This visual cue would allow readers to instantly identify the exact substitutions without needing to read and compare the entire sentences, making the changes more salient.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 2 Effect of gender and explanatory variables on sentiment (mixed effects...
Full Caption

Table 2 Effect of gender and explanatory variables on sentiment (mixed effects model)

Figure/Table Image (Page 6)
Table 2 Effect of gender and explanatory variables on sentiment (mixed effects model)
First Reference in Text
Table 2 presents the estimates from the mixed effects model.
Description
  • Presents statistical model results for two sentiment metrics: This table displays the output of a 'mixed-effects model', a statistical tool used to analyze data with complex structures, such as multiple summaries originating from the same document. The analysis is split into two sections, one for each sentiment analysis metric used: 'Regard' and 'SiEBERT'. These metrics are computational methods that assign a numerical score to a piece of text to quantify its emotional tone. The table shows how different factors, or 'explanatory variables', are associated with the sentiment scores of the generated summaries.
  • Quantifies the effect of LLM model, gender, and summary length on sentiment: The table lists several variables and their estimated effects ('Estimate') on sentiment. These variables include the specific Large Language Model (LLM) used (Gemma, Llama3, T5), the gender of the subject ('gendermale'), and the maximum length of the summary ('Max tokens'). A key part of the table are the 'interaction terms', such as 'Model gemma: Male'. This term specifically tests whether the effect of gender on sentiment is different when using the Gemma model compared to a baseline model (BART).
  • Highlights significant gender-based differences for the Gemma model: The most notable results are the interaction effects. For the Gemma model, the interaction with male gender ('Model gemma: Male') shows a statistically significant negative estimate for both the 'Regard' metric (Estimate = -0.0110, p = 4.5e-05) and the 'SiEBERT' metric (Estimate = -0.0330, p = 1.0e-07). This indicates that, compared to the baseline, the Gemma model produces summaries about men with a more negative sentiment score than summaries about otherwise identical women. In contrast, the 'Model llama3: Male' interaction is not significant for 'Regard' and shows a small positive effect for 'SiEBERT', suggesting Llama 3 does not exhibit the same pattern of bias as Gemma.
Scientific Validity
  • ✅ Appropriate statistical model selection: The use of a mixed-effects model is highly appropriate for this experimental design. It correctly accounts for the non-independence of observations that arise from generating multiple summaries (with different models and parameters) from the same source document, by treating document ID as a random effect. This increases the statistical validity of the findings.
  • ✅ Robust testing of the primary hypothesis via interaction terms: The inclusion of interaction terms (e.g., 'Model gemma: Male') is the correct and most rigorous way to test the central research question of whether gender bias varies across different LLMs. This approach provides more nuanced and powerful evidence than simply examining the main effect of gender alone.
  • ✅ Comprehensive reporting of statistical results: The table provides all the necessary components for a full evaluation of the model results: the coefficient estimates, standard errors, t-statistics, and precise p-values. This level of transparency allows for independent assessment and interpretation of the model's findings.
  • 💡 The complexity of interaction effects can be challenging to interpret directly: While methodologically sound, interpreting the coefficients for interaction terms in the context of reference levels can be non-intuitive. The authors rightly supplement this table with an analysis of estimated marginal means (Table 3) in the text to make the findings clearer. This is good practice, as the main table's results require careful interpretation to understand the magnitude and direction of effects for each model relative to the baseline.
Communication
  • ✅ Clear side-by-side structure for comparing metrics: The table is well-organized, presenting the results for the 'Regard' and 'SiEBERT' models in parallel columns. This layout makes it very easy for the reader to compare the findings across the two different sentiment analysis methods.
  • 💡 Lacks explicit definition of reference categories: The table is not fully self-contained because it omits the reference levels for the categorical variables. The reader must refer to the methods section to know that the baseline model is BART, the reference gender is female, and the reference token length is 50. Adding a footnote to the table stating these reference categories would significantly improve clarity and make the table easier to interpret on its own.
  • 💡 Inconsistent formatting of p-values: The p-values are presented in multiple formats (e.g., '0.0e+00', '4.5e-05', '5.1e-02'). For better readability and consistency, it would be preferable to standardize this. For instance, use '< 0.001' for very small values and report others to three decimal places (e.g., 'p = 0.051').
  • 💡 Important interaction terms could be visually emphasized: The key findings of the table are the interaction terms listed at the bottom. To help guide the reader's attention to these crucial results, consider adding a horizontal line or a sub-header to visually separate the main effects from the interaction effects.
Table 3 Estimated marginal mean effect of gender on sentiment (female - male)
Figure/Table Image (Page 6)
Table 3 Estimated marginal mean effect of gender on sentiment (female - male)
First Reference in Text
As the coefficients and p values in Table 2 are compared with reference levels, which can be challenging to interpret, Table 3 includes the estimated marginal means by gender for each of the models, calculated using the emmeans R package [54].
Description
  • Summarizes the net effect of gender on sentiment for each language model: This table simplifies the complex statistical results from Table 2. It shows the 'estimated marginal mean' effect, which is a statistical way of calculating the average difference in sentiment scores between summaries about females and summaries about males, for each of the four language models tested. The 'Estimate' column represents this difference (female score minus male score), so a positive number indicates that, on average, summaries for females had a more positive sentiment.
  • Highlights a significant pro-female sentiment bias in the Gemma model: The most prominent finding is for the 'gemma' model. For both sentiment metrics used ('Regard' and 'SiEBERT'), the estimate is positive and highly statistically significant. For the SiEBERT metric, the estimate is 0.0420 with a p-value of 0.000, indicating a strong and consistent tendency for Gemma to generate summaries with more positive sentiment for female subjects compared to male subjects.
  • Shows no significant gender-based sentiment difference for the Llama 3 model: In contrast to Gemma, the 'llama3' model shows no statistically significant difference in sentiment based on gender. The p-values for both the 'Regard' (p=0.250) and 'SiEBERT' (p=0.200) metrics are large, suggesting that any observed difference in sentiment scores for Llama 3 is likely due to random chance.
  • Reveals mixed and smaller effects for the benchmark models: The older models, 'bart' and 't5', show statistically significant but smaller and less consistent effects. For instance, the 't5' model shows a negative estimate for both metrics, suggesting a slight pro-male sentiment bias (Regard Estimate = -0.0049; SiEBERT Estimate = -0.0100), which is the opposite pattern to the Gemma model.
Scientific Validity
  • ✅ Excellent use of estimated marginal means for interpretability: Presenting the estimated marginal means (or the difference between them) is a methodologically sound and highly effective way to interpret the significant interaction effects observed in Table 2. It translates the complex regression coefficients into a direct, meaningful comparison of gender effects for each model, which is the core of the research question.
  • ✅ Provides strong, direct evidence for the paper's conclusions: The table offers clear, quantitative support for the central finding that gender bias varies significantly across different LLMs. The stark contrast between the highly significant bias in Gemma and the lack of significant bias in Llama 3 is a powerful result that is well-supported by the statistical evidence presented.
  • 💡 The caption could be slightly more precise: The caption 'Estimated marginal mean effect of gender on sentiment (female - male)' is functional but could be misinterpreted. A more precise caption would be 'Difference in Estimated Marginal Mean Sentiment Scores (Female - Male) by Model'. This clarifies that the 'Estimate' column represents the difference between the two means, not a single mean effect.
Communication
  • ✅ Highly effective data summarization: This table is an excellent example of how to communicate complex statistical findings effectively. It distills the main takeaway from the interaction model in Table 2 into a simple, easy-to-understand format that directly compares the models on the key outcome of interest.
  • ✅ Clear and logical layout: The structure of the table, with models as rows and the two sentiment metrics as column groups, is intuitive and facilitates easy comparison across both models and metrics. The inclusion of estimate, t-statistic, and p-value is standard and appropriate.
  • 💡 The meaning of the 'Estimate' column could be reinforced: While the caption specifies that the estimate is 'female - male', this crucial detail could be easily missed. To make the table more self-contained, consider changing the column header from 'Estimate' to 'Difference (Female - Male)'. This would immediately orient the reader to the direction of the effect.
  • 💡 Inconsistent formatting of p-values: The p-values are reported with varying precision (e.g., '0.00013', '0.000', '0.031'). Adopting a consistent format, such as reporting all values to three decimal places and using '<0.001' for very small values, would enhance the table's professional appearance and readability.
Table 4 Chi-squared tests for gender differences in word counts by theme across...
Full Caption

Table 4 Chi-squared tests for gender differences in word counts by theme across LLMs

Figure/Table Image (Page 6)
Table 4 Chi-squared tests for gender differences in word counts by theme across LLMs
First Reference in Text
The results of the analysis of terms relating to each theme are presented in Table 4.
Description
  • Compares word frequencies across predefined themes and genders for each LLM: This table breaks down the analysis by four different Large Language Models (LLMs): Bart, Gemma, Llama3, and t5. For each model, it examines four themes: Physical health, Physical appearance, Mental health, and Subjective language. The table presents the total count of words related to each theme found in summaries generated for female subjects versus male subjects. It then uses a statistical test called a 'chi-squared test' to determine if the difference in word counts between genders is statistically significant, meaning it's unlikely to be due to random chance. An 'Adjusted p-value' is also provided, which is a more conservative statistical measure used when conducting multiple tests to ensure the findings are robust.
  • Gemma model shows significant inclusion bias towards men for health-related themes: The most striking results are for the Gemma model. It used significantly more words related to 'Physical health' (15,065 for males vs. 14,391 for females), 'Physical appearance' (2,014 for males vs. 1,832 for females), and 'Mental health' (3,623 for males vs. 3,351 for females). The adjusted p-values for these differences are very small (0.001, 0.013, and 0.008, respectively), indicating a strong tendency for Gemma to include these topics more frequently when summarizing records about men.
  • Llama3 model shows no significant thematic bias: In contrast to Gemma, the Llama3 model demonstrates no statistically significant differences in thematic word counts between genders. For all four themes, the adjusted p-values are high (all > 0.6), suggesting that this model includes topics in a balanced way regardless of the subject's gender.
  • Benchmark models (Bart and t5) show minimal bias: The older benchmark models show little to no thematic bias. The Bart model has one significant finding, using more 'Subjective language' for men (6,684 vs. 6,323). The t5 model shows no significant differences for any theme after statistical correction.
Scientific Validity
  • ✅ Appropriate use of statistical testing and correction: The use of a chi-squared test is appropriate for comparing the frequency counts of thematic words between two independent groups (male and female summaries). Crucially, the application of the Benjamini-Hochberg (BH) correction for multiple comparisons demonstrates strong methodological rigor, as it controls the false discovery rate across the 16 tests performed, increasing confidence in the reported significant findings.
  • ✅ Provides clear evidence for inclusion bias: The table effectively quantifies inclusion bias, directly addressing one of the study's core research questions. By showing the raw counts alongside statistical significance, it provides compelling evidence that certain models, particularly Gemma, systematically include different topics based on gender.
  • 💡 Lacks a measure of effect size: While the table reports statistical significance (p-values), it does not include a measure of effect size (e.g., Cramer's V for chi-squared tests). An effect size would quantify the magnitude of the observed associations, helping the reader to distinguish between statistically significant but potentially trivial differences and those that are practically meaningful. This would add valuable context to the interpretation of the results.
  • 💡 Thematic analysis is inherently a simplification: Aggregating words into broad themes is a useful but simplified approach. This analysis does not capture nuances within a theme (e.g., whether 'mental health' terms used for men are different from those used for women). The authors acknowledge this and address it with a subsequent word-level analysis, but it's a limitation of what can be concluded from this table alone.
Communication
  • ✅ Clear, well-organized structure: The table is logically structured, with clear groupings by LLM. This makes it very easy for readers to compare the performance of the different models side-by-side and quickly identify which models exhibit bias.
  • ✅ Effective use of significance indicators: The asterisks used to denote different levels of statistical significance are a standard and highly effective visual cue. They immediately draw the reader's attention to the key findings without requiring them to parse every p-value.
  • 💡 Direction of difference is not immediately apparent: To determine which gender has a higher count for a significant result, the reader must manually compare the two 'Count' columns. To improve efficiency, consider adding a column that explicitly states the direction of the effect (e.g., 'M > F' or 'F > M') for statistically significant rows. This would make the main takeaway from each row instantly clear.
  • 💡 Column headers could be slightly more descriptive: The column 'Chi-sq p-value' is technically correct but could be written as 'Unadjusted p-value' to create a clearer contrast with the 'Adj. p-value (BH)' column, immediately signaling to the reader which value is the one to prioritize for interpretation.
Table 5 Word level differences regression and χ² output
Figure/Table Image (Page 8)
Table 5 Word level differences regression and χ² output
First Reference in Text
Different models exhibited varying degrees of bias, as shown in the results of the word-level analysis presented in Table 5.
Description
  • Presents a detailed statistical analysis of individual word usage by gender for each LLM: This table identifies specific words that are used with a statistically significant difference in frequency when summarizing care records for female versus male subjects. The analysis is presented for three models: Bart, Gemma, and t5. For each word, the table provides the raw counts for females and males, alongside the results of two different statistical tests: a 'regression output' which assesses the association at the document level, and a 'Chi Sq / Fisher test' which compares the overall frequencies. Only words that passed a significance threshold on both tests are included, ensuring the findings are robust.
  • Reveals extensive linguistic bias in the Gemma model: The Gemma model shows the largest number of biased words. A key pattern is the use of narrative-framing words more frequently for women, such as 'Text' (5,042 times for females vs. 2,726 for males) and 'Describe' (3,295 vs. 1,764). Conversely, words describing needs and status are used more for men, such as 'Unable' (373 for males vs. 276 for females), 'Require' (1,845 vs. 1,498), and 'Complex' (167 vs. 105). This suggests summaries for women tend to describe the document itself, while summaries for men more directly describe the person's condition.
  • Shows stereotypical word bias in the Bart model: The Bart model displays bias through words that align with common gender stereotypes. For example, 'Emotional' is used much more often in summaries about women (33 times vs. 6 for men), while 'Anxious' is used almost exclusively for men (30 times vs. 1 for women). This highlights how older models can reproduce and reinforce societal biases in their output.
  • Indicates no significant word-level bias for the Llama3 model: The Llama3 model is conspicuously absent from this table. This is a significant finding in itself, as it implies that under this rigorous statistical analysis, no individual words were found to be used with a statistically significant difference in frequency based on gender. This contrasts sharply with the other models, particularly Gemma.
Scientific Validity
  • ✅ Employs a highly robust, dual-method statistical approach: The methodology of requiring a word to be statistically significant in both a regression model and a chi-squared/Fisher's exact test (with Benjamini-Hochberg correction) is exceptionally rigorous. This dual-filter approach minimizes the risk of false positives (Type I errors) that can arise from conducting thousands of individual tests, greatly increasing confidence in the reported word differences.
  • ✅ Provides granular, interpretable evidence of linguistic bias: Moving from the thematic analysis in Table 4 to this word-level analysis provides concrete, actionable evidence of how bias manifests. Identifying specific words like 'emotional' or 'unable' is far more illustrative and impactful than a general finding of bias in a 'mental health' theme. This level of detail is crucial for understanding and potentially mitigating the bias.
  • 💡 The stringent criteria may lead to an underestimation of bias: While the dual-testing approach is a strength, its stringency increases the risk of false negatives (Type II errors). Some genuinely biased word usage patterns that do not meet the significance threshold on both tests will be excluded. The authors allude to this in the text with the example of the word 'unwise'. This trade-off between rigor and sensitivity is inherent, but it's important to recognize that this table likely represents the most pronounced instances of bias, not all instances.
  • 💡 Lacks a measure of effect size: The table reports statistical significance but does not include a standardized measure of effect size for the word associations (e.g., odds ratio from the regression, or Cramer's V for the chi-squared test). While raw counts are provided, an effect size would help in comparing the magnitude of bias for different words and across different models, distinguishing between statistically significant but small effects and those that are large and practically meaningful.
Communication
  • ✅ Clear organization by model: The table is effectively structured with clear sub-headings for each LLM (Bart, Gemma, t5). This grouping allows the reader to easily assess the specific bias profile of each model and compare the extent of bias between them.
  • 💡 The table is very dense and long: The comprehensive nature of the table makes it quite long and information-dense, which can be overwhelming for the reader. To improve scannability, consider using alternating row shading. Additionally, a summary in the main text that highlights and discusses a few of the most illustrative examples from each model would be beneficial to guide the reader through the key findings.
  • 💡 The absence of Llama3 should be made explicit: The most important finding for Llama3 is its absence from the table, but this is only communicated implicitly. A reader might assume it was omitted by mistake. Adding a simple note at the beginning or end of the table, such as 'No words met the significance criteria for the Llama3 model,' would make this crucial result clear and unambiguous.
  • 💡 Column headers are not user-friendly: The statistical column headers, such as 'Pr(>|t|)' and 'Coef', are standard outputs from statistical software but are not intuitive for all readers. Using more descriptive labels like 'p-value (Regression)' and 'Adjusted p-value (χ²)' would significantly enhance clarity and make the table more self-contained.
Table 6 Differences in model-generated descriptions for gender-swapped pairs of...
Full Caption

Table 6 Differences in model-generated descriptions for gender-swapped pairs of case notes (BART and T5 models)

Figure/Table Image (Page 9)
Table 6 Differences in model-generated descriptions for gender-swapped pairs of case notes (BART and T5 models)
First Reference in Text
Sentences from the BART and T5 models with large differences in sentiment between the male and female summaries are presented in Table 6 for the purpose of contrasting with Llama 3 and Gemma.
Description
  • Provides qualitative examples of gender bias from benchmark models: This table presents side-by-side comparisons of text summaries generated by the BART and T5 models. For each row, the models were given the same underlying case note, but in two versions: one describing a male subject and one describing a female subject. The table highlights cases where the resulting summaries were significantly different, illustrating two types of bias: 'inclusion bias', where new information is added for one gender but not the other, and 'linguistic bias', where the same situation is described using a different tone or framing.
  • Illustrates the addition of negative judgments for female subjects: The table shows clear examples of inclusion bias where negative characterizations are added specifically to the summaries about women. For instance, the BART model adds the sentence 'Ms Smith continues to make unwise decisions about her care needs' to a female summary, a judgment absent from the male counterpart. Similarly, the T5 model adds 'She is verbally and physically abusive' to a female summary, a severe accusation not present in the parallel male summary.
  • Demonstrates stereotypical framing of men's and women's needs: The examples reveal linguistic bias through stereotypical framing. In one BART example, the summary for a man focuses on his 'views and wishes', emphasizing agency. The corresponding summary for a woman focuses on her 'emotional wellbeing' and the 'risk of wandering', emphasizing vulnerability. Another T5 example describes a man simply as 'fine', while the woman in the identical situation is described as 'dishevelled' with 'dirty and scruffy' clothes, showing a stark difference in the level of negative detail provided.
Scientific Validity
  • ✅ Effectively translates quantitative data into tangible examples: This table is an excellent use of qualitative evidence to support and illustrate the quantitative findings from the statistical analyses (e.g., sentiment scores in Tables 2 & 3). Presenting these concrete examples makes the abstract concept of 'bias' tangible and demonstrates the real-world implications of the models' outputs.
  • ✅ Provides strong, direct support for the paper's claims about bias in older models: The chosen examples are not subtle; they are powerful and directly support the conclusion that the BART and T5 models exhibit significant gender bias. The differences shown are qualitatively large and would likely influence a care professional's perception of the individual.
  • 💡 The selection criteria for examples are not fully transparent: The reference text states these sentences are from pairs with 'large differences in sentiment', but 'large' is not quantitatively defined. For enhanced rigor, the selection method should be more specific. For example, clarifying if these were randomly selected from the top decile of sentiment difference, or if they were manually chosen for illustrative purposes. A more systematic sampling method would strengthen the claim that these examples are representative of a wider pattern.
Communication
  • ✅ Highly effective side-by-side comparison format: The three-column layout ('Male', 'Female', 'Model') is extremely clear and effective. It allows for immediate, direct comparison between the gendered outputs for each model, making the differences instantly apparent to the reader.
  • ✅ Table is largely self-contained and impactful: A reader can understand the core finding of bias in these models simply by reading this table and its caption. The examples are powerful enough to stand on their own and effectively communicate the nature of the bias without requiring extensive reference to the main text.
  • 💡 Key differences could be visually highlighted for greater impact: To make the table even more efficient at communicating its message, consider using bold formatting to highlight the specific words or phrases that constitute the biased difference. For example, bolding '**unwise decisions**' or '**verbally and physically abusive**' would immediately draw the reader's eye to the most critical discrepancies, enhancing scannability and impact.
Table 7 Differences in descriptions of disability for gender-swapped pairs...
Full Caption

Table 7 Differences in descriptions of disability for gender-swapped pairs (Gemma model)

Figure/Table Image (Page 9)
Table 7 Differences in descriptions of disability for gender-swapped pairs (Gemma model)
First Reference in Text
Examples of these differences in the description of disability are set out in Table 7.
Description
  • Provides qualitative examples of linguistic bias from the Gemma model: This table showcases specific instances of 'linguistic bias' from the Gemma language model. It presents pairs of short text summaries generated from identical source information, with the only difference being the gender of the subject. The table highlights how the model uses different language and framing to describe the same disability for men versus women.
  • Contrasts direct, definitive language for men with indirect, euphemistic language for women: A consistent pattern shown is the use of more direct and definitive language for men. For example, a male subject is described as 'unable to meet his needs', while the female counterpart 'requires assistance with daily living activities'. Similarly, a man is labeled a 'disabled individual', while for the woman, the summary states 'The text describes Mrs. Smith's current living situation'. This shifts the focus from the woman's condition to a description of the document itself.
  • Illustrates how women's capabilities are framed differently: The table provides an example where a woman's ability to cope is emphasized, potentially downplaying her needs. The summary states, 'Despite her mobility issues and memory problems, Mrs Smith is able to manage her daily activities'. This framing, which contrasts a negative with a positive, is not shown in the parallel examples for men, suggesting a tendency to portray women as managing well in spite of their conditions.
Scientific Validity
  • ✅ Provides powerful qualitative evidence for quantitative findings: This table serves as an excellent qualitative complement to the quantitative word-frequency analysis in Table 5. It takes the abstract finding that words like 'unable' and 'disabled' are used more for men and provides concrete, interpretable examples of how this linguistic bias manifests in practice, strengthening the overall argument.
  • ✅ Highlights the practical significance of the observed bias: The chosen examples effectively demonstrate how subtle differences in wording could lead to different perceptions by a human reader, such as a care professional. Describing a man as 'unable' versus a woman as 'requiring assistance' could impact the perceived urgency or severity of need, making these examples highly relevant to the study's conclusions about potential allocational harm.
  • 💡 The method for selecting examples is not specified: For enhanced scientific transparency, it would be beneficial to briefly state how these specific examples were chosen. Were they randomly sampled from all pairs showing discrepancies, or were they hand-picked for their illustrative power? Clarifying the selection process would help address potential concerns about cherry-picking and strengthen the claim that these examples are representative of a broader pattern.
Communication
  • ✅ Extremely clear and effective side-by-side format: The simple two-column 'Male' vs. 'Female' layout is highly effective. It allows for immediate and direct comparison, making the linguistic differences stark and easy to grasp without needing to refer back to the main text.
  • ✅ Table is focused and impactful: By focusing solely on the Gemma model and the specific theme of disability, the table delivers a clear, concise, and powerful message. Its specificity avoids overwhelming the reader and effectively lands its point about the nature of Gemma's bias.
  • 💡 Highlighting key terms would improve scannability: The table's effectiveness could be further enhanced by using bold formatting to highlight the specific words and phrases that differ between the pairs (e.g., '**unable**' vs. '**requires assistance**'; '**disabled individual**' vs. '**The text describes...**'). This would guide the reader's eye directly to the core of the comparison, making the table's message even more immediate.
Table 8 Differences in descriptions of complexity for gender-swapped pairs...
Full Caption

Table 8 Differences in descriptions of complexity for gender-swapped pairs (Gemma model)

Figure/Table Image (Page 9)
Table 8 Differences in descriptions of complexity for gender-swapped pairs (Gemma model)
First Reference in Text
Table 8 provides examples, showing that men are more often described as having a “complex medical history,” while women are simply described as having a "medical history."
Description
  • Illustrates selective use of the word 'complex' based on gender: This table presents several side-by-side examples of text generated by the Gemma language model from identical source information, differing only by the subject's gender. The key pattern highlighted is that summaries for male subjects frequently use the phrase 'complex medical history' to describe their health status. In contrast, for the corresponding female subjects, the summary often omits the word 'complex', simply stating they have a 'medical history'.
  • Shows differing narrative focus for female subjects: Beyond the omission of the word 'complex', the table shows that the narrative focus for female subjects is often shifted. For instance, instead of detailing the medical history, the summary for a woman might describe her living situation ('a 78-year-old lady living alone in a town house') or her ability to cope ('Despite her limitations, she is independent'). This demonstrates a type of linguistic bias where the model frames the same underlying facts differently, emphasizing men's medical severity and women's social context or resilience.
Scientific Validity
  • ✅ Provides compelling qualitative evidence for a specific linguistic bias: The table offers strong, direct evidence to support the quantitative finding from Table 5 that the word 'complex' is used more frequently for men. These examples are powerful because they are not just about word counts; they demonstrate a clear difference in framing that has significant practical implications for how a case might be perceived.
  • ✅ The chosen examples are highly relevant to the study's central argument: The difference between a 'medical history' and a 'complex medical history' is a clinically and administratively meaningful distinction. By showing this specific bias, the table effectively illustrates the potential for 'allocational harm', where differences in documentation could lead to disparities in service allocation or urgency.
  • 💡 Lack of transparent selection criteria for examples: As with the other qualitative tables, the methodology for selecting these specific examples is not described. To enhance scientific rigor, it is important to clarify whether these examples were chosen for their illustrative power or selected based on a systematic sampling procedure (e.g., randomly chosen from all pairs exhibiting this specific word difference). This would strengthen the claim that these examples are representative of a consistent pattern.
Communication
  • ✅ Clear, focused, and highly effective presentation: The simple two-column 'Male' vs. 'Female' layout is very effective for highlighting the direct contrast. The table is focused on a single concept ('complexity'), which makes its message unambiguous and impactful.
  • ✅ The table is self-contained and easy to interpret: The examples are clear enough that a reader can understand the nature of the bias from the table and its caption alone, without needing to refer back to the main text. This makes it a very efficient communication tool.
  • 💡 Highlighting the key word would enhance immediate comprehension: The table's clarity could be further improved by using bold text to emphasize the key word in question. For example, showing 'Mr. Smith has a **complex** medical history' in the male column would instantly draw the reader's attention to the specific point of difference, making the comparison even more immediate.
Table 9 Inclusion bias: comparison of gender-swapped pairs (Gemma model)
Figure/Table Image (Page 10)
Table 9 Inclusion bias: comparison of gender-swapped pairs (Gemma model)
First Reference in Text
the Gemma model shows significant gender-based disparities.
Description
  • Demonstrates 'inclusion bias' by the Gemma model: This table provides qualitative examples of 'inclusion bias', a type of error where a language model includes or omits different pieces of information based on a protected characteristic like gender, even when the source material is identical. The table presents side-by-side comparisons of text summaries generated by the Gemma model for male and female subjects from the same underlying case note.
  • Shows omission of specific medical diagnoses for female subjects: A key pattern illustrated is the omission of specific, critical medical information in summaries about women. In the most striking example, a man's summary explicitly lists his conditions as 'delirium, a chest infection, and Covid 19'. For the corresponding female subject, these specific diagnoses are replaced with the much vaguer phrase 'subsequent health complications'. Similarly, a 'serious fall and fractured bone' for a man is summarized as 'her healthcare needs' for a woman.
  • Contrasts direct clinical facts for men with procedural or generalized descriptions for women: The examples highlight a tendency to describe men's situations with direct clinical facts, while women's situations are described in more procedural or generalized terms. For example, a male subject is noted to be receiving care under the 'Mental Health Act', a specific legal status. The corresponding female summary instead mentions that 'Her care needs are managed by her Specialist Clinical Nurse', focusing on the care arrangement rather than the underlying reason.
Scientific Validity
  • ✅ Provides powerful, direct evidence of potential harm: This table is scientifically crucial because it moves beyond statistical differences to show concrete examples of clinically significant information being omitted for one gender. The omission of diagnoses like 'Covid 19' or a 'fractured bone' is not a subtle linguistic nuance; it's a major discrepancy that directly supports the paper's argument about the risk of 'allocational harm'.
  • ✅ The examples are highly illustrative and support the quantitative findings: These qualitative examples provide a compelling narrative for the quantitative data presented earlier. They give a tangible face to the statistical bias, making the implications of the findings much clearer and more impactful.
  • 💡 The selection method for the examples is not specified: For full methodological transparency, it would be beneficial to state how these specific examples were chosen. Clarifying whether they were selected at random from pairs showing bias or were hand-picked for their illustrative power would strengthen the evidence by addressing potential concerns of cherry-picking the most extreme cases.
Communication
  • ✅ Highly effective and clear side-by-side comparison: The two-column 'Male' vs. 'Female' layout is an extremely effective way to present these comparisons. It allows the reader to see the stark differences immediately, making the table's point without requiring extensive explanation.
  • ✅ The table is impactful and largely self-contained: The chosen examples are so clear and significant that the table effectively communicates the severity of the inclusion bias on its own. A reader can grasp the core issue just from the table and its caption.
  • 💡 Highlighting key differences would enhance scannability: To make the contrasts even more immediate, consider using bold formatting to highlight the specific phrases that differ. For example, bolding '**delirium, a chest infection, and Covid 19**' in the male column and '**subsequent health complications**' in the female column would instantly draw the reader's eye to the most critical discrepancy.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Key Aspects

Strengths

Suggestions for Improvement